What’s up with reproducibility in R?
Thing I want to talk about
- Identify what must be managed for reproducibility
- Learn about the following tools to turn your projects reproducible
- What I will not discuss (but is very useful!):
- FP, Git, Documenting, testing and packaging code, build automation with {targets}
What I mean by reproducibility
- Ability to recover exactly the same results from an analysis
- Why would you want that?
- Auditing purposes
- Update of data (only impact must be from data update)
- Reproducibility as a cornerstone of science
- (Work on an immutable dev environment) . . .
- “But if I have the original script and data, what’s the problem?”
Reproducibility is on a continuum (1/2)
Here are the 4 main things influencing an analysis’ reproducibility:
- Version of R used
- Versions of packages used
- Operating system
- Hardware
Reproducibility is on a continuum (2/2)
![]()
Source: Peng, Roger D. 2011. “Reproducible Research in Computational Science.” Science 334 (6060): 1226–27
The problem
Works on my machine!
We’ll ship your computer then.
A typical project’s setup
- Our project: housing in Luxembourg
- Data to analyse:
vente-maison-2010-2021.xlsx in the data folder
- 2 scripts to analyse data (in the
scripts/project_start folder):
- One to scrape the Excel file save_data.R
- One to analyse the data analysis.R
Project start - What’s wrong with these scripts?
- The first two scripts -> script-based workflow
- Just a long series of calls
- No functions
- difficult to re-use!
- difficult to test!
- difficult to parallelise!
- lots of repetition (plots)
- Usually we want a report not just a script
- No record of package, nor R, versions used
Turning our scripts reproducible
We need to answer these questions
- How easy would it be for someone else to rerun the analysis?
- How easy would it be to update the project?
- How easy would it be to reuse this code for another project?
- What guarantee do we have that the output is stable through time?
The easiest, cheapest thing you should do
- Generate a list of used packages and R using
{renv}
Recording packages and R version used
Create a renv.lock file in 2 steps!
- Open an R session in the folder containing the scripts
- Run
renv::init() and check the folder for renv.lock
(renv::init() will take some time to run the first time)
What an renv.lock file looks like
{
"R": {
"Version": "4.2.2",
"Repositories": [
{
"Name": "CRAN",
"URL": "https://packagemanager.rstudio.com/all/latest"
}
]
},
"Packages": {
"MASS": {
"Package": "MASS",
"Version": "7.3-58.1",
"Source": "Repository",
"Repository": "CRAN",
"Hash": "762e1804143a332333c054759f89a706",
"Requirements": []
},
"Matrix": {
"Package": "Matrix",
"Version": "1.5-1",
"Source": "Repository",
"Repository": "CRAN",
"Hash": "539dc0c0c05636812f1080f473d2c177",
"Requirements": [
"lattice"
]
***and many more packages***
Restoring a library using an renv.lock file
renv.lock file not just a record
- Can be used to restore as well!
- Run
renv::restore() (answer Y to active the project when asked)
{renv} conclusion
Shortcomings:
- Records, but does not restore the version of R
- Installation of old packages can fail (due to missing OS-dependencies)
- Generating a
renv.lock file is “free”
- Provides a blueprint for dockerizing our pipeline
- Creates a project-specific library (no interferences)
Where are we in the continuum?
- Package and R versions are recorded
- Packages can be restored (but not always!)
- In conclusion: project-specific libraries of packages are not enough
Let’s add a layer: Docker
Remember the problem: works on my machine?
Turns out we will ship the whole computer to solve the issue using Docker.
What is Docker
- Docker is a containerisation tool that you install on your computer
- Docker allows you to build images and run containers (a container is an instance of an image)
- Docker images:
- contain all the software and code needed for your project
- are immutable (cannot be changed at run-time)
- can be shared on- and offline
A word of warning
- Docker works best on Linux and macOS (Docker images are built on top of Linux distributions)
- Possible to run on Windows, but need to enable options in the BIOS and WSL2
- Properly dockerizing a project requires practice (but current LLMs are really helpful with this)
“Hello, Docker!”
- Start by creating a so-called Dockerfile
- Dockerfile = recipe for an image
- Build the image:
docker build -t hello .
- Run a container:
docker run --rm --name hello_container hello
--rm: remove the container after running
--name some_name: name your container some_name
Without Docker
With Docker
Dockerizing a project (1/2)
- At image build-time:
- install R (or use an image that ships R)
- install packages (using our
renv.lock file)
- copy all scripts to the image
- run the analysis using
targets::tar_make()
- At container run-time:
- copy the outputs of the analysis from the container to your computer
- possible to “log-in” into a running container to inspect code and outputs
Dockerizing a project (2/2)
- The built image can be shared, or only the Dockerfile (and users can then rebuild the image)
- The outputs will always stay the same!
- Working interactively using Docker can be challenging though
The Rocker project
- Possible to build new images from other images
- The Rocker project provides many images with R, RStudio, Shiny, and other packages pre-installed
- We will use the Rocker images “r-ver”, specifically made for reproducibility
Docker: a panacea?
- Docker is very useful and widely used
- But the entry cost is high
- Single point of failure (what happens if Docker gets bought, abandoned, etc? –quite unlikely though–)
- Not actually dealing with reproducibility per se, we’re “abusing” Docker in a way
The Nix package manager (1/2)
Package manager: tool to install and manage packages
Package: any piece of software (not just R packages)
A popular package manager:
The Nix package manager (2/2)
- Reproducibility: R, R packages and other dependencies must be managed
- Nix is a package manager actually focused on reproducible builds
- Nix deals with everything, with one single text file (called a Nix expression)!
- These Nix expressions always build the exact same output
A basic Nix expression (1/6)
let
pkgs = import (fetchTarball "https://github.com/NixOS/nixpkgs/archive/976fa3369d722e76f37c77493d99829540d43845.tar.gz") {};
system_packages = builtins.attrValues {
inherit (pkgs) R ;
};
in
pkgs.mkShell {
buildInputs = [ system_packages ];
shellHook = "R --vanilla";
}
There’s a lot to discuss here!
A basic Nix expression (2/6)
- Written in the Nix language (not discussed)
- Defines the repository to use (with a fixed revision)
- Lists packages to install
- Defines the output: a development shell
A basic Nix expression (3/6)
- Software for Nix is defined as a mono-repository of tens of thousands of expressions on Github
- Github: we can use any commit to pin package versions for reproducibility!
- For example, the following commit installs R 4.3.1 and associated packages:
pkgs = import (fetchTarball "https://github.com/NixOS/nixpkgs/archive/976fa3369d722e76f37c77493d99829540d43845.tar.gz") {};
A basic Nix expression (4/6)
system_packages: a variable that lists software to install
- In this case, only R:
system_packages = builtins.attrValues {
inherit (pkgs) R ;
};
A basic Nix expression (5/6)
- Finally, we define a shell:
pkgs.mkShell {
buildInputs = [ system_packages ];
shellHook = "R --vanilla";
}
- This shell will come with the software defined in
system_packages (buildInputs)
- And launch
R --vanilla when started (shellHook)
A basic Nix expression (6/6)
- Writing these expressions requires learning a new language
- While incredibly powerful, if all we want are per-project reproducible dev shells…
- …then
{rix} will help!
Nix expressions
- Nix expressions can be used to install software
- But we will use them to build per-project development shells
- We will include R, LaTeX packages, or Quarto, Python, Julia….
- Nix takes care of installing every dependency down to the compiler!
CRAN and Bioconductor
- CRAN is the repository of R packages to extend the language
- As of writing, +20000 packages available
- Biocondcutor: repository with a focus on Bioinformatics: +2000 more packages
- Almost all available through
nixpkgs in the rPackages set!
- Find packages here
rix: reproducible development environments with Nix (1/5)
{rix} (website) makes writing Nix expression easy!
- Simply use the provided
rix() function:
library(rix)
rix(r_ver = "4.3.1",
#date = "2025-01-27", <- a date also works
r_pkgs = c("dplyr", "ggplot2"),
system_pkgs = NULL,
git_pkgs = NULL,
tex_pkgs = NULL,
ide = "code",
project_path = ".")
rix: reproducible development environments with Nix (2/5)
renv.lock files can also be used as starting points:
library(rix)
renv2nix(
renv_lock_path = "path/to/original/renv_project/renv.lock",
project_path = "path/to/rix_project",
override_r_ver = "4.4.1" # <- optional
)
rix: reproducible development environments with Nix (3/5)
- List required R version and packages
- Optionally: more system packages, packages hosted on Github, or LaTeX packages
- Optionally: an IDE (Rstudio, Radian, VS Code or “other”)
- Work interactively in an isolated, project-specific and reproducible environment!
rix: reproducible development environments with Nix (4/5)
rix::rix() generates a default.nix file
- Build expressions using
nix-build (in terminal) or rix::nix_build() from R
- “Drop” into the development environment using
nix-shell
- Expressions can be generated even without Nix installed
rix: reproducible development environments with Nix (5/5)
- Can install specific versions of packages (write
"dplyr@1.0.0")
- Can install packages hosted on Github
- Many vignettes to get you started! See here
Let’s check out scripts/nix_expressions/rix_intro/
Non-interactive use
{rix} makes it easy to run pipelines in the right environment
- (Little side note: the best tool to build pipelines in R is
{targets})
- See
scripts/nix_expressions/nix_targets_pipeline
- Can also run the pipeline like so:
cd /absolute/path/to/pipeline/ && nix-shell default.nix --run "Rscript -e 'targets::tar_make()'"
Nix and Github Actions: running pipelines
- Possible to easily run a
{targets} pipeline on Github actions
- Simply run
rix::tar_nix_ga() to generate the required files
- Commit and push, and watch the actions run!
- See here.
Nix and Github Actions: writing papers
- Easy collaboration on papers as well
- See here
- Just focus on writing!
Subshells
- Also possible to evaluate single functions inside a “subshell”
- Works from R installed via Nix or not!
- Very useful to use hard-to-install packages such as {arrow}
- See
scripts/nix_expressions/subshell
R packages release cycle
- CRAN is updated daily, but it’s not reflected in nixpkgs
- The
rPackages set gets updated around new R releases (every 3 months or so)
- What if more recent packages are required?
- One solution: use our
nixpkgs fork from our rstats-on-nix organisation!
- See
scripts/nix_expressions/bleeding
Conclusion
- Very vast and complex topic!
- At the very least, generate an
renv.lock file
- Always possible to rebuild a Docker image in the future (either you, or someone else!)
- Consider using
{targets}: not only good for reproducibility, but also an amazing tool all around
- Long-term reproducibility: must use Docker or Nix (better: both!) and maintenance effort is required as well
The end
Contact me if you have questions:
- bruno@brodrigues.co
- Twitter: @brodriguesco
- Mastodon: @brodriguesco@fosstodon.org
- Blog: www.brodrigues.co
- Book: www.raps-with-r.dev